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Abstract — The problem of statistical learning is to construct 
a predictor of a random variable Y as a function of a related 
random variable X on the basis of an i.i.d. training sample from 
the joint distribution of (A, Y). Allowable predictors are drawn 
from some specified class, and the goal is to approach asymp- 
totically the performance (expected loss) of the best predictor 
in the class. We consider the setting in which one has perfect 
observation of the A-part of the sample, while the Y-part has 
to be communicated at some finite bit rate. The encoding of the 
Y- values is allowed to depend on the A -values. Under suitable 
regularity conditions on the admissible predictors, the underlying 
family of probability distributions and the loss function, we give 
an information-theoretic characterization of achievable predictor 
performance in terms of conditional distortion-rate functions. 
The ideas are illustrated on the example of nonparametric 
regression in Gaussian noise. 

I. Introduction and problem statement 

Let A and Y be jointly distributed random variables, where 
A takes values in an input space X and Y takes values in an 
output space y. The problem of statistical learning is about 
constructing an accurate predictor of Y as a function of X on 
the basis of some number of independent copies of (A, Y), 
often with very little or no prior knowledge of the underlying 
distribution. A very general decision-theoretic framework for 
learning was proposed by Haussler [1]. In a slightly simplified 
form it goes as follows. Let V be a family of probability 
distributions on Z = X xy. Each member P of V represents 
a possible relationship between A and Y, Also given are a 
loss function £ : y x y — > R + and a set T of functions 
(hypotheses) from X into y. For any / £ T and any P 6 V 
we have the expected loss (or risk) 

L(f, P) = E£(f(X),Y) ee J e(f(x),y)dP(x, y), 

which expresses quantitatively the average performance of / 
as a predictor of Y from A when (A, Y) ~ P. Let us define 
the minimum expected loss 

L*{T,P) = M -Uf,P) 

and assume that the infimum is achieved by some /* S T. 
Then /* is the best predictor of Y from X in the hypothesis 
class T when (A, Y) ~ P. The problem of statistical learning 
is to construct, for each n, an approximation to /* on the 
basis of a training sequence {Zi}™ =1 , where Zi — (Xi,Yi) 
are i.i.d. according to P, such that this approximation gets 
better and better as the sample size n tends to infinity. This 
formulation of the learning problem is referred to as agnostic 



(or model-free) learning, reflecting the fact that typically only 
minimal assumptions are made on the causal relation between 
A and Y and on the capability of the hypotheses in T 
to capture this relation. It is general enough to cover such 
problems as classification, regression and density estimation. 

Formally, a learning algorithm (or learner, for short) is a 
sequence {fn}^Li of maps f n : Z n x X — > y, such that 
f n (Z n , ■) € T for all n and all Z n € Z n . Let Z = (A, Y) ~ 
P be independent of the training sequence Z n . The main 
quantity of interest is the generalization error of the learner, 

L(f n ,P) ± E\l(f n (Z n ,X),Y) Z 



e(f n (Z n ,x),y)dP(x,y). 



The generalization error is a random variable, as it depends 
on the training sequence Z n . One is chiefly interested in the 
asymptotic probabilistic behavior of the excess loss L( f n , P) — 
L*(T,P) as n -► oo. (Clearly, L(f n ,P) > L*{T,P) for 
every n.) Under suitable conditions on the loss function I, the 
hypothesis class T, and the underlying family V of probability 
distributions, one can show that there exist learning algorithms 
which not only generalize, i.e., EL(f n ,P) — > L*(fF,P) as 
n — > oo for every P S V (which is the least one could ask 
for), but are also probably approximately correct (PAC), i.e. 



lim P(z n :L(f n ,P)> L*{F,P) 



o (i) 



for every e > and every P £ V . (See, e.g., Vidyasagar [2].) 

This formulation assumes that the training data are available 
to the learner with arbitrary precision. This assumption may 
not always hold, however. For example, the location at which 
the training data are gathered may be geographically separated 
from the location where the learning actually takes place. 
Therefore, the training data may have to be communicated to 
the learner over a channel of finite capacity. In that case, the 
learner will see only a quantized version of the training data, 
and must be able to cope with this to the extent allowed by the 
fundamental limitations imposed by rate-distortion theory. In 
this paper, we consider a special case of such learning under 
rate constraints, when the learner has perfect observation of the 
input part X n = (A 1; . . . , X n ) of the training sequence, while 
the output part Y n = (Yi, . . . , Y n ) has to be communicated 
via a noiseless digital channel whose capacity is R bits per 
sample. This situation, shown in Figure Q] may arise, for 
example, in remote sensing, where the Xi's are the locations 
of the sensors and the Yi's are the measurements of the sensors 
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Fig. 1 . The set-up for learning from compressed data with side information. 



having the form fo(Xi) + Zi, where fo : X — > [0, 1] is some 
unknown function and the Z, 's are i.i.d. zero-mean Gaussian 
random variables with variance a 2 . Assuming that the sensors 
are dispersed at random over some bounded spatial region 
X and the location of each sensor is known following its 
deployment, the task of the sensor array is to deliver, over a 
rate-limited channel, an approximation Y n of the measurement 
vector Y n = (Yi, ... ,Y n ) to some central location, where the 
vector X n of the sensor locations and the compressed version 
Y n of the sensor measurements will be fed into a learner that 
will approximate fo by some function f n (X n ,Y n ,-) from a 
given hypothesis class T. 

In this paper, we establish information-theoretic upper 
bounds on the achievable generalization error in this setting. 
In particular, we relate the problem of agnostic learning under 
(partial) rate constraints to conditional rate-distortion theory 
[3, Section 6.1], [4], [5, Appendix A], which is concerned with 
lossy source coding in the presence of side information both at 
the encoder and at the decoder. In the set-up shown in Figure Q] 
the input part X n — (X%, . . . , X n ) of the training sequence, 
which is available both to the encoder and to the decoder 
(hence to the learner), plays the role of the side information, 
while the output part Y n = (Yi, . . . , Y n ) is to be coded using 
a lossy source code operating at the rate of R bits per symbol. 
Furthermore, because the distribution of (X, Y) is known only 
to be a member of some family V, the lossy codes must be 
robust in the presence of this uncertainty. 

Let us formally state the problem. Let V,J-,£ be given. 
A scheme for agnostic learning under partial rate constraints 
(from now on, simply a scheme) operating at rate R is 
specified by a sequence of triples {(e n , d n , fn)}%Li> where 



X n x y r ' 



{!,..., 2 nR } is the encoder, d n : X r 



{1, . . . , 2 nR } -> y n is the decoder, and /„ : X n x y n — ► T 
is the learner. We shall often abuse notation and let /„ denote 
also the function f n (X n ,Y n ,-). For each n, the output of 
the learner is a hypothesis f n (X n ,Y n , •) £ T, where Y n = 
d n (X n , e n (X n , Y n )) is the reproduction of Y n given the side 
information X n . For any P 6 V, the main object of interest 
associated with the scheme is the generalization error 



i(/„,P)=E £(f n (X n ,Y n ,X),Y) 



X n ,Y r - 



where (X,Y) ~ P is assumed independent of {(Xi,Yi)}f =1 
(to keep the notation simple, we suppress the dependence 
of the generalization error on the encoder and the decoder). 
In particular, we are interested in the achievable values of 
the asymptotic expected excess risk. We say that a pair 
(R, A) is achievable for {T^V^t) if there exists a scheme 



{(e„, d n) f n )}™=i operating at rate R, such that 
limsupE J L(/„,P) < L*{T,P) + A 

n — >og 

for every P 6 V. After listing the basic assumptions in 
Sec. HIl we derive in Sec. [ill] sufficient conditions for (R, A) 
to be achievable. We then apply our results to the setting of 
nonparametric regression in Sec.[lV] Discussion of results and 
an outline of future directions are given in Sec. [V] 

A. Related work 

Previously, the problem of statistical estimation from com- 
pressed data was considered by Zhang and Berger [6], 
Ahlswede and Burnashev [7] and Han and Amari [8] from 
the viewpoint of multiterminal information theory. In these 
papers, the underlying family of distributions of (X, Y) is 
parametric, i.e., of the form V = {Pe}eee^ where 9 is a 
subset of R fc for some finite fc, and one wishes to estimate 
the "true" parameter 9*. The i.i.d. observations {(Xi,Yi)}™ =1 
are drawn from Pg«, and the input part X n is communicated 
to the statistician at some rate R%, while the output part 
Y n is communicated at some rate i?2- The present work 
generalizes to the nonparametric setting the case considered 
by Ahlswede and Burnashev [7], namely when Ri = oo. 
To the best of the author's knowledge, this paper is the 
first to consider the problem of nonparametric learning from 
compressed observations with side information. 

II. Assumptions 

We begin by stating some basic assumptions on T, V and I. 
Additional assumptions will be listed in the sequel as needed. 

The input space X is taken to be a measurable subset of IR d , 
while the output space is either a finite set (as in classification) 
or the set of reals M. (as in regression or function estimation). 
We assume throughout that the family V of distributions on 
X x y is such that the mutual information I(X; Y) < oo for 
every P e V. All information-theoretic quantities will be in 
bits, unless specified otherwise. 

We assume that there exists a learning algorithm which 
generalizes optimally in the absence of any rate constraints. 
Therefore, our standing assumption on (T^V,€) will be that 
the induced function class = {If : / S J 7 }, where 
£f{z) = £(f(x),y) for all z = (x,y) € Z, satisfies the 
uniform law of large numbers (ULLN) for every FeP, i.e., 



sup 



1 n 

-J2tf(z l )-^£ f (z) 



0, 



a.s. 



(2) 



where Z, Z2, ■ ■ ■ are i.i.d. according to P. Eq. (O implies 
that, for any sequence {/„} C T, 



1 - 



0, 



a.s. 



This holds even in the case when each /„ is random, i.e., 
/«(') = fn(Z n ,-). The ULLN is a standard ingredient in 
proofs of consistency of learning algorithms: if {!F, V, £) are 



such that (O holds, then the Empirical Risk Minimization 
algorithm (ERM), given by 

1 ™ 

fn = argmin - Y] £f(Z t ), 

is PAC in the sense of (Q]l [2, Theorem 3.2]. 

Next, we assume that the loss function £ has the follow- 
ing "generalized Lipschitz" property: there exists a concave, 
continuous function r\ : M + — > M + , such that for all / £ f, 

x e X and it, u' S y 



\£(f(x),u)-£(f(x),u')\<r ] (£(u,u')). 



(3) 



This holds, for example, in the following cases: 

• Suppose that £ is a metric on Then, by the triangle 
inequality we have £(y,u) < £(y,u') +£(u',u) for all 
y, u, u' E y, so (0 holds with rj(t) = t. 

• Suppose that y = [0, 1] and £(u, u') = \u — u'\ p for some 
p > 1. Then one can show that 

|*(/(x),u)-*(/(a!),u')| <p|«-«'| 

for all f : X ^ y, x e X and u, u' e y, so G) holds 

with r/(t) = pt 1 ^. 
Finally, we need to pose some assumptions on the metric 
structure of the class V with respect to the variational distance 
[9, Sec. 5.2], which for any two probability distributions 
Pi, Pi on a measurable space (Z,A) is defined by 

d v (P 1 ,P 2 ) = 2 sup \Px(A) - P 2 (A)\. 
AeA 

A finite set {Pi, . . . ,Pm} C V is called an (internal) e-net 
for V with respect to dy if 

sup min dv(P,P m ) < £• 

PeV l< m <^f 

The cardinality of a minimal e-net, denoted by N(e,V), is 
called the e-covering number of "P w.r.t. dy, and the ,KoZ- 
mogorov e-entropy of is defined as H(e,V) = logN(e,V) 
[10]. We assume that the class V satisfies Dobrushin's entropy 
condition [11], i.e., for every c > 



n m ^2 = o. 



(4) 



This condition is satisfied, for example, in the following cases: 
(1) X and y are both finite sets; (2) V is a finite family; (3) 
Z is a compact subset of a Euclidean space, and all P e 
are absolutely continuous with densities satisfying a uniform 
Lipschitz condition [10], [11]. 

III. The results 

To state our results we shall need some notions from condi- 
tional rate-distortion theory [3, Sec. 6.1], [4], [5, Appendix A]. 
Fix some P E V. Given a pair (X, Y) ~ P and a nonnegative 
real number D, define the set Ai{D) to consist of all De- 
valued random variables Y jointly distributed with (X, Y) and 
satisfying the constraint K£(Y, Y) < D, where the expectation 
is taken with respect to the joint distribution of X, Y, Y. Then 



the conditional rate-distortion function of Y given X w.r.t. P 
is defined by 



R Ylx (D,P)±mf{l(Y; 



Y\X) : Y e 



M(D)) 



where I(Y;Y\X) is the conditional mutual information be- 
tween Y and Y given X. Our assumption that I{X\ Y) < 
oo ensures the existence of Ryix^tP) [5]. In operational 
terms, R Y \x(D, P) is the minimum number of bits needed 
to describe Y with expected distortion of at most D given 
perfect observation of a correlated random variable X (the 
side information) when (X, Y) ~ P. As a function of D, 
Ry\x(D, P) is convex and strictly decreasing everywhere it 
is finite, hence it is invertible. The inverse function is called 
the conditional distortion-rate function of Y given X and is 
denoted by D Y \x(R,P)- Finally, let 

®Y\x(R,V) = sup D Y]X {R,P). 
Per 

We assume that B Y \x{R,P) < oo for all R > 0. 

We shall also need the following lemma, which can be 
proved by a straightforward extension of Dobrushin's random 
coding argument from [11] to the case of side information 
available to the encoder and to the decoder: 

Lemma 3.1. Let V satisfy Dobrushin's entropy condition (|4|l. 
Assume that the loss function £ either is bounded or satisfies 
a uniform moment condition 



supE[^y,y 



,1+^1 



< oo 



(5) 



Per 



for some 5 > with respect to some fixed reference letter 
yo € y~ Then for every rate R > there exists a sequence 
{(e„,d„)}£° =1 of encoders e„ : X n xy n -> {1, . . . , 2 nR } and 
decoders d n : X n x {1, . . . , 2 nR } -> y n , such that 

Km sup supE^ n (F",y") <B Y]X (R,V), 

n—>oo PeV 

where Y n = J n {X n ,e n (X n ,Y n )) and £ n {Y n ,Y n ) = 
n l ^2"=x Yi) i s me normalized cumulative loss between 

yn and yn 

Our main result can then be stated as follows: 

Theorem 3.1. Under the stated assumptions, for any R > 
there exists a scheme {(e n , d n , f n )} operating at rate R, such 
that 

lim sup EL(fn, P) < L* (T, P) + 2 V (B Ylx (R, V)). 

n— >oo 

Thus, (R,2-q(n Y \ x {R,V))) is achievable for every R > 0. 

Proof: Given n, Z n E Z n and / E T, define the 
empirical risk 



n 

L Z n(f)^-J2if(Z n ) 



and the minimum empirical risk 



J t *' 



We shall write Lx n ,Y n if) an d L* x „ Y ™ (•?"") whenever we need 
to emphasize separately the roles of X n and Y n . 

Suppose that the encoder e„ and the decoder d n are given. 
Let Y n denote the reproduction of Y n given the side infor- 
mation X n , i.e., Y n = d n (X n ,e n (X n ,Y n )). We then define 
our learner f n by 



f n = argmini x „ p„(/). 



(6) 



In other words, having received the side information X n and 
the reproduction Y n , the learner performs ERM over T on 
{(Xi,Yi)}" =1 . Using the property © of the loss function t 
and the concavity of 77, we have the following estimate: 

sup \L X n tY »{f) - L Xn y n {f)\ 



1 " 

< sup - V \t(f(Xi), Yi) - t(f(Xi), Y t ] 

1 n 
n f— ' 

2—1 

< v (£ n (Y n ,Y n )), 
In particular, this implies that 



(7) 



\Lx",Y n (fn) 



and 



J X",Y'' 



X^.Y' 



{fn)\<r}{ln{Y n ,Y n )) (8) 



,(^)| <v(£n(Y n ,Y n )). (9) 



We then have 

Lx",r«(/n) 



(a) 
< 

(b) 



(c) 



L 



X n ,Y 



Xfn)+r,{i n {Y n ,Y n )) 
L Xrl?n (r) + V (£ n (Y\Y n )) 



< L Xn . Yn (F) + 2r ] (l n (Y n 1 Y n )), 

where (a) follows from (|8), (b) from the definition of /„, and 
(c) from (O. Suppose that the data are distributed according 
to a particular P 6 V. Taking expectations and using the 
concavity of rj and Jensen's inequality, we obtain 

E L Z n (/„) < E % „ {?) + 277 ( E £ n (Y n , Y n )) . 

Using this bound and the continuity of 77, we can write 

limsupE L(f n ,P) - L*(T, P) 

n— >oo 

< lim E[L(f n ,P)-L Z n(f n )] 



+ lim E[L|„(^)-i*(^,P)] 



-277(limsupE£„(y n ,r™) 



(10) 



The two leading terms on the right-hand side of this inequality 
are zero by the ULLN. Moreover, given R, Lemma 13.1 
asserts the existence of a sequence {(e n , d n )}^ =1 of encoders 



e„ : X n x y 
{!,..., 2"«}- 



{1, . . . , 2 nR } and decoders d n : X n x 
y n , such that 



limsupEC(^™,y n ) < By\x(R,V), VP e P. 

n — >oo 

Substitution of this into (fTOb proves the theorem. ■ 

Corollary 3.2. All pairs (i?, A) with A > 2r;(Py| X (P, P)) 
are achievable. 

Remark 3.1. In the Appendix, we show that a correspond- 
ing lower bound derived by the usual methods for proving 
converses in lossy source coding is strictly weaker than 
the "obvious" lower bound based on the observation that 
EL(f n ,P) > L*(T,P) for any J n . It may be possible to 
obtain nontrivial lower bounds in the minimax setting, which 
we leave for future work (see also Sec. IW 

Remark 3.2. Under some technical conditions on the function 
class {If : / € J 7 } (see, e.g., [12]), one can show that 



E sup 



L zn (f)-L(f,P) <C/V^i, 



VP G V 



for some constant C that depends on J 7 , 1. Using this fact and 
the same bounding method that led to Eq. ( fTol ). but without 
taking the limit superior, we can get the following finite- 
sample bound for every scheme {(e n , d n , / n )}^=i with f n 
given by © and arbitrary e n ,d n : 

E L(/„, P) < L*{T, P) + 2 V (E£ n (Y n , Y n )) + C/yfc, 

where C = 2C. 

The following theorem shows that we can replace condition 
(0 with the requirement that £ be a power of a metric: 

Theorem 3.3. Suppose that the loss function I is of the form 
£(y, u) = d(y, u) r for some r > 1, where d is a metric on y. 
Then for any rate R > the scheme constructed in the proof 
of Theorem 13.11 is such that 



lim sup E 



L(f n ,P) 1/r ] <L*(T,P) 



l/r 



'■y\x 



(R,P) 



l/r 



holds for every P S V. 



Proof: We proceed essentially along the same lines as in 
the proof of Theorem |3.1| except that the bound ((T) is replaced 
with an argument based on Minkowski's inequality to yield 



i Z »(/n) 1/r 



< 



l/r 



2 E£ n (Y n ,Y n ) 



l/r 



The rest is immediate using the ULLN as well as concavity 
and continuity of 1 1— > t x l r for t > 0. ■ 

IV. Example: nonparametric regression 

As an example, let us consider the setting of nonparametric 
regression. Let X be a compact subset of M. d and y = M. The 
training data are of the form 



Y i = f (X i ) + Z i 



l<i<n 



(11) 



where the regression function /o belongs to some specified 
class T of functions from X into [0, 1], the Xj's are i.i.d. ran- 
dom variables drawn from the uniform distribution on X, 
and the Z^'s are i.i.d. zero-mean normal random variables 
with variance a 2 , independent of X n . We take £(y,u) = 
\y — u\ 2 , the squared loss. Note that £ satisfies the condition 
of Theorem 13 . 3 1 with r = 2. 

Because fo is unknown, we take as the underlying family 
V the class of all absolutely continuous distributions with 
densities of the form pf(x,y) = V^ 1 Af(y; f(x), a 2 ), f G T, 
where V is the volume of X and N(y\ f(x), a 2 ) is the one- 
dimensional normal density with mean f(x) and variance a 2 . 
Because the functions in T are bounded between and 1, 
it is easy to show that the uniform moment condition (|5]l of 
Lemma [37X1 is satisfied with S = 1 and y n = 0. 

We suppose that £ and T are such that the function class £?r 
satisfies the ULLN0 Let Q denote the uniform distribution on 
X and for any square-integrable function / on X define the 
Li norm by 

11/112,0= / f 2 (x)dQ(x) = ± [ f{x)dx. 

Let us denote by A^2,g(e, T) the e-covering number of T w.r.t. 
|| • ||2,Q, i-e., the smallest number M such that there exist M 
functions {f m }m=i m ? satisfying 

sup min ||/- fmh,Q < e. 
l<m<M 

We assume that J- is such that for every c > 

0. 



lim logiV 2 ,Q(e,^) 



2 c/e 



(12) 



This condition holds, for example, if the functions in T are 
uniformly Lipschitz or if A" is a bounded interval in M and T 
consists of functions satisfying a Sobolev-type condition [10]. 

Lemma 4.1. If T satisfies ( TT2l . then V satisfies Dobrushin's 
entropy condition @. 

Proof: Given / G T, let Pf denote the distribution with 
the density pj. It is straightforward to show that 



i(P f \\P 9 ) = ^\\f-g\\l 



where /(-|| ) is the relative entropy (information divergence) 
between two probability distributions, in nats. Using Pinsker's 
inequality d v (P 1 ,P 2 ) < y / 2I(P 1 \\P 2 ) [9, Lemma 5.2.8], we 
get 

dv(Pf\\P g )<-\\f-9h,Q, V/,.gef. (13) 
a 

Given e > 0, let {f m }m=i c ^ be a cre-net for ^" w.r.t. ||-||2,Q- 
Then from ( fT3l it follows that 

J In r> \ ^ ■ \\f ~ frnWl.Q , 

sup mm dyyPf.Pt ) < sup mm — < e, 

f e yrl<m<M J J " f e j?l<m<M a 

'See Gyorfi et al. [13] for a detailed exposition of the various conditions 
when this is true. 



i.e., {Pf m }m=i i s an e-net for V w.r.t. dy. This implies, in 
particular, that N(e,V) < N 2 . Q {ae,T) for every e > 0. This, 
together with ( fT2b . proves the lemma. ■ 

Lemma 4.2. For any R > 0, B Y \ X (R,P) = v 2 2- 2R . 

Proof: Fix some / G T and consider a pair (X, Y) ~ 
P f . Then Y = f(X) + Z, where Z - Normal(0, a 2 ) 
is independent of X. Because £ is a difference distortion 
measure, Theorem 7 of [4] says that, for any measurable 
function tp : X — > y, 

DY\x(R,P f ) = DY-Mx)\x(R,Pf-i,), 

where -P/—0 is the distribution of 

Y-iP(X) = f(X)-iP(X) + Z; 

furthermore, if Y — "4>{X) is independent of X, then 
Dy\x(P-i Pf) — Dy-mx){R)> me (unconditional) distortion- 
rate function of Y — ip(X). Taking ?/> = /, we get 
Dy\x(P, Pf) = P > {R,cr 2 ), the distortion-rate function of a 
memoryless Gaussian source with variance a 2 w.r.t. squared 
error loss, which is equal to a 2 2~ 2R [3, Theorem 9.3.2]. Hence 
D Y \x(P, Pf) is independent of /. Taking the supremum over 
T finishes the proof. ■ 

Now we can state and prove the main result of this section: 

Theorem 4.1. Consider the regression setting of (fTTT >. Under 
the stated assumptions, for any R > there exists a scheme 
{{e n ,d n J n )}^ =1 , such that 



lim sup I 



Hfn,Pf) 



1/2 



< er(l 



-R+i 



(14) 



holds for every / G J 7 . 

Proof: As follows from the above, the triple {T^V^t) 
satisfies all the assumptions of Theorem |33l Therefore for any 
R > there exists a scheme {(e n , dn, / n )}5£Li operating at 
rate R, such that 

limsupE \L(f n ,P f ) 1/2 ] < L*(F,P f ) 1/2 + 2- R+1 a, (15) 

holds for every / G T (we have also used Lemma [4~2l . It is 
not hard to show that 

L(g,P f ) = \\f-g\\l Q + <T 2 , Vf,geF, 

whence it follows that L*(J-,Pf) = a 2 for every / G T. 
Substituting this into ( fT5l ), we get ( TBI . ■ 

V. Discussion and future work 

We have derived information-theoretic bounds on the 
achievable generalization error in learning from compressed 
data (with side information). There is a close relationship 
between this problem and the theory of robust lossy source 
coding with side information at the encoder and the decoder. 
A major difference between this setting and the usual set- 
ting of learning theory is that the techniques are no longer 
distribution-free because restrictions must be placed on the 
underlying family of distributions in order to guarantee the 



existence of a suitable source code. The theory was applied to 
the problem of nonparametric regression in Gaussian noise, 
where we have shown that the penalty incurred for using 
compressed observations decays exponentially with the rate. 

We have proved Theorems 13.11 and 13.31 by adopting ERM 
as our learning algorithm and optimizing the source code to 
deliver the best possible reconstruction of the training data. In 
effect, this imposes a separation structure between learning 
and source coding. While this "modular" approach is simplistic 
(clearly, additional performance gains could be attained by 
designing the encoder, the decoder and the learner jointly), it 
may be justified in such applications as remote sensing. For 
instance, if the source code and the learner were designed 
jointly, then any change made to the hypothesis class (say, 
if we decided to replace the currently used hypothesis class 
with another based on tracking the prior performance of the 
network) might call for a complete redesign of the source code 
and the sensor network, which may be a costly step. With the 
modular approach, no such redesign is necessary: one merely 
makes the necessary adjustments in the learning algorithm, 
while the sensor network continues to operate as before. 

Let us close by sketching some directions for future work. 
First of all, it would be of interest to derive information- 
theoretic lower bounds on the generalization performance 
of rate-constrained learning algorithms. In particular, just as 
Ahlswede and Burnashev had done in the parametric case [7], 
we could study the asymptotics of the ninimax excess risk 



E 



[H(Yi\Xi) - H{Yi\X n 



S„(R) 



inf sup 



EL(f n ,P)-L*(T.P) 



where the infimum is over all encoders, decoders and learners 
operating on a length-n training sequence at rate R. Secondly, 
we could dispense with the assumption that the learner has 
perfect observation of the input part of the training sample, in 
analogy to the situation dealt with by Zhang and Berger [6]. 
Finally, keeping in mind the motivating example of sensor 
networks, it would be useful to replace the block coding 
approach used here with an efficient distributed scheme. 
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Appendix 

Let us assume for simplicity that V is a singleton, V = {P}, 
and that y is a finite set. Consider a scheme {(e„, d n , /«)} 
operating at rate R. Fix n and define the n-tuple W n via 

W % =f n (X n ,Y n ,Xi), l<i<n. 

Also, let J = e n (X n , Y n ). Then we can write 

nR > H(J\X n ) 

> H(Y n \X n ) 

> I(Y n ;Y n \X n ) 

= H{Y n \X n )- H(Y n \X n ,Y n ) 

= H(Y n \X n )- H(Y n \X n ,Y n ,W n ) (A.l) 



> ^[HMlXJ-HiYilX^Wi)] 

z=l 
n 

= ^IiY^WilXi) 

i=l 
n 

> Y, R y\x^^w h Yi),p) 

z=l 

> nR Y{x (Ee n (W n ,Y n ),P), 

where (IA. It follows from the fact that W n is a function 
of Y n and X n . The remaining steps follow from standard 
information-theoretic identities and from convexity. Therefore, 

limmfE4 l (W' m ,y ,l ) > D Y]X (R,P). 

n — >oo 

Because EL(f n , P) = El n (W n , Y n ) + o(l) by the ULLN, 
limMEL(f n ,P)>D Y]x {R,P), (A.2) 

71 — >00 

Now, given any / e T, we can interpret f(X) as a zero-rate 
approximation of Y (using only the side information X), so 
L(f,P) > Dy\x(0,P) > D Y \ X (R,P) for any R > 0. In 
particular, L*(T,P) > D Y \ X (R,P) for all R, and 

liminfEL(/„,P) > L*(F,P) > D Y{X (R,P) 

n — >oc 

for all R. Thus, the information-theoretic lower bound (1A.21 > 
is weaker than the bound liminf EL(f n , P) > L*(T,P). 

n — >oo 
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